Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

نویسندگان

  • Mehdi Lotfi
  • Seyed Ahmad Motamedi
چکیده

Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce a new approach for blocking coordinated checkpointing using two-level checkpointing. The first level of checkpointing is local checkpointing, and computing nodes save the checkpoints in local disk. If a transient failure occurs in the computing node, the process can recover from local disk. Second level of checkpointing is global checkpointing and computing nodes send their checkpoints to highly reliable global stable storage. If a permanent failure occurs in the computing node, it can not be used and the process can recover from global storage in a new computing node. Local checkpoints are taken more frequently than global checkpoints. Also, in the end of each local checkpointing interval, the system determines the expected recovery time in the case of permanent failure and adaptively takes a global checkpoint, or skips. Experimental results show that average execution time of NAS-BT application is significantly reduced by using the proposed method. Maximum reduction of execution time of this application is 38%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches a...

متن کامل

On the Impossibility of Min-Process Non-Blocking Checkpointing and An Efficient Checkpointing Algorithm for Mobile Computing Systems

Mobile computing raises many new issues, such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Prakash and Singhal [14] proposed the first coordinated checkpointing algorithm for mobile computing systems. However, we showed that their algorithm may result in an inconsiste...

متن کامل

A New High Performance Checkpointing Approach for Mobile Computing Systems

In this paper, we present a single phase non-blocking coordinated checkpointing algorithm suitable for mobile computing environments. The distinct advantages that make the proposed algorithm suitable for distributed mobile computing systems are the following. It produces a consistent set of checkpoints, without the overhead of taking temporary checkpoints; the algorithm makes sure that only min...

متن کامل

Anti-message Logging Based Coordinated Checkpointing Protocol for Deterministic Mobile Computing Systems

A checkpoint algorithm for mobile computing systems needs to handle many new issues like: mobility, low bandwidth of wireless channels, lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional checkpointing techniques unsuitable for such environments. Minimum-process coordinated checkpointing is an attract...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Inf. Sci. Eng.

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2010